Automatically enriching spoken corpora with syntactic information for linguistic studies
نویسندگان
چکیده
Syntactic parsing of speech transcriptions faces the problem of the presence of disfluencies that break the syntactic structure of the utterances. We propose in this paper two solutions to this problem. The first one relies on a disfluencies predictor that detects disfluencies and removes them prior to parsing. The second one integrates the disfluencies in the syntactic structure of the utterances and train a disfluencies aware parser.
منابع مشابه
Detecting Annotation Errors in Spoken Language Corpora
Consistency of corpus annotation is an essential property for the many uses of annotated corpora in computational and theoretical linguistics. While some research addresses the detection of inconsistencies in part-of-speech and other positional annotation (van Halteren, 2000; Eskin, 2000; Dickinson and Meurers, 2003a), more recently work has also started to address errors in syntactic and other...
متن کاملComparative study of oral and written French automatically tagged with morpho-syntactic information
In this paper, we investigate automatic tagging of French corpora and compare morpho-syntactic properties of spoken and written language on corpora from different sources. Morpho-syntactic properties are first described according to the distribution of the 8 main POS in five corpora of about 1 million words each. The automatic tagging was made with about a hundred tags and we will describe the ...
متن کاملHow Spoken Language Corpora Can Refine Current Speech Motor Training Methodologies
The growing availability of spoken language corpora presents new opportunities for enriching the methodologies of speech and language therapy. In this paper, we present a novel approach for constructing speech motor exercises, based on linguistic knowledge extracted from spoken language corpora. In our study with the Dutch Spoken Corpus, syllabic inventories were obtained by means of automatic ...
متن کاملIntegrating Linguistic and Signal Knowledge in a Morpheme Based Speech Corpus Annotation Tool
As more and more speech systems require high-level linguistic knowledge to accommodate various levels of applications, corpora that are tagged with high-level linguistic annotations as well as signal-level annotations are highly recommended for development of today's speech systems. Among the high-level linguistic annotations, POS (part-of-speech) tag annotations are indispensable in speech cor...
متن کاملTowards an integrated representation of multiple layers of linguistic annotation in multilingual corpora
There has been an increasing interest in recent years in the enrichment of natural language corpora in terms of annotation with explicit linguistic information. This interest manifests itself most prominently in two areas of linguistics: corpus linguistics and computational linguistics. For corpus linguistics, the long standing practice has been to work on raw, i.e., unannotated text. While raw...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014